Data storage resource allocation using category blacklisting when data management requests fail

ABSTRACT

A resource allocation system begins with an ordered plan for matching requests to resources that is sorted by priority. The resource allocation system optimizes the plan by determining those requests in the plan that will fail if performed. The resource allocation system removes or defers the determined requests. In addition, when a request that is performed fails, the resource allocation system may remove requests that require similar resources from the plan. Moreover, when resources are released by a request, the resource allocation system may place the resources in a temporary holding area until the resource allocation returns to the top of the ordered plan so that lower priority requests that are lower in the plan do not take resources that are needed by waiting higher priority requests higher in the plan.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a Divisional of U.S. patent application Ser. No. 14/804,446, entitled “Data Storage Resource Allocation Using Blacklisting of Data Sorage Requests Classified in the Same Category as a Data Storage Request that is Determined to Fail if Attempted,” filed on Jul. 21, 2015, which is a Divisional of U.S. patent application Ser. No. 12/142,423, entitled “Data Storage Resource Allocation by Performing Abbreviated Resource Checks Based on Relative Chances of Failure of the Data Storage Resources to Determine Whether Data Storage Requests Would Fail,” filed on Jun. 19, 2008, and which are hereby incorporated by reference in their entireties herein.

BACKGROUND

Systems used to perform data storage operations of electronic data are growing in complexity. However, current systems may not be able to accommodate increased data storage demands or efficient and timely restore operations. Often, these systems are required to store large amounts of data (e.g. all of a company's data files) during a time period known as a “storage window.” The storage window defines a duration and actual time period when the system may perform storage operations. For example, a storage window may be for twelve hours, between 6 PM and 6 AM (that is, twelve non-business hours). Often, storage windows are rigid and unable to be modified. Therefore, when data storage systems attempt to store increasing data loads, they may need to do so without increasing the time in which they operate. Additionally, many systems perform daily stores, which may add further reliance on completing storage operations during allotted storage windows.

Moreover, each data storage operation requires multiple resources, such as access to a tape drive, allocation of a stream for that tape drive, an available tape on which to store data, a media agent computer to process and monitor the request, and so forth. Given multiple data storage requests and multiple resources, with each request requiring different resources for different periods of time, optimizing allocation of these resources can be a very complex operation as the number of requests and resources grow. Processor time can grow exponentially as the requests and resources grow.

Multidimensional resource allocation is an inherently complex problem to solve. As noted above, a number of disparate resources need to be available to satisfy a single request, such as available media, a compatible drive from a pool of drives, etc. Also, additional constraints must be satisfied, such as a load factor on a computer writing the data, a number of allowed agents or writers to a target media (e.g., disk or tape), etc.

Rules of resource allocation further complicate the problem. For example, rules may be established regarding failover such that when a given drive fails, the system can substitute in another drive. Likewise, rules may be established for load balancing so as not to overtax a given drive, but to spread the burden over a pool of drives. If a primary resource candidate is not available, then the system may allocate resources from an alternate resource pool, which may or may not be satisfactory. Time delay factors arise when alternatives are considered.

Furthermore, resource requests arrive in a random order; however, each incoming request has either a pre-assigned or dynamically changing priority. Furthermore, resources are freed up in a random order and may be associated with lower priority requests. A multiple set matching algorithm is not possible in such a complex environment.

In order to make a best match, a sorted list of requests is often maintained. This queue of requests is then walked and resources allocated to higher priority requests first before lower priority requests can be honored. The matching process for each request is very time consuming given the number of resources that must be made available for each job.

Prior systems have attempted to ameliorate these problems by reducing the number of variables and thereby reducing the complexity of such optimizations of resource allocations. Other systems have employed dedicated resources, often for higher priority requests. However, when those resources become freed up, they sit idle until other requests dedicated to those resources arrive. Other systems have solved this complexity problem by simply reducing the number of requests and creating smaller units of resources. This fragments a system, and can be inefficient.

Requests in a data management system often ultimately fail. For example, a required resource may be down or in short supply. Unfortunately, the data management system has often committed significant resources in the resource allocation process before the request fails. For example, the data management system may spend precious time gathering other resources or data only to discover that the tape drive to which the data should be copied is not available. This causes the data management system to waste time that reduces the amount of productive work that the system can perform during the storage window.

The foregoing examples of some existing limitations are intended to be illustrative and not exclusive. Other limitations will become apparent to those of skill in the art upon a reading of the Detailed Description below. These and other problems exist with respect to data storage management systems.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1A is a block diagram illustrating an example of components used in data storage operations.

FIG. 1B is a block diagram illustrating an alternative example of components used in data storage operations.

FIG. 1C is a block diagram illustrating an alternative example of components used in data storage operations.

FIG. 2 is a block diagram illustrating an example of a data storage system.

FIG. 3 is a block diagram illustrating an example of components of a server used in data storage operations.

FIG. 4 is a flow diagram illustrating an example optimize requests routine.

FIG. 5 is a flow diagram illustrating an example handle requests routine.

FIG. 6 is a flow diagram illustrating an example build grid routine.

FIG. 7 is a block diagram illustrating an example environment in which a grid store may be applied.

In the drawings, the same reference numbers and acronyms identify elements or acts with the same or similar functionality for ease of understanding and convenience. To easily identify the discussion of any particular element or act, the most significant digit or digits in a reference number refer to the Figure number in which that element is first introduced (e.g., element 420 is first introduced and discussed with respect to FIG. 4).

The headings provided herein are for convenience only and do not necessarily effect the scope or meaning of the claimed invention.

DETAILED DESCRIPTION

Described in detail herein is a resource allocation system that may employ one or more resource optimization subsystems or routines, which may be used individually or collectively to provide resource allocation in a data storage management environment. The system helps match resources in a multidimensional data storage management environment in a more efficient fashion, which permits highly scalable systems. As systems grow, a queue of jobs grows. For example, a queue may include thousands of data storage jobs, where each job is associated with multiple resource requests. Improving the allocation of resources enables more jobs to be performed within a particular time window, such as overnight or during off-peak hours for a business.

In some embodiments, the resource allocation system begins with an ordered plan for matching requests to resources that is sorted by priority. The initial priority scheme may be determined in a variety of ways. For example, an administrator may determine which items are most important and give those items a higher priority. As another example, the system may prioritize requests according to geographic or network routing, such as by establishing a preference for satisfying a request with the closest computer to a computer holding data for a request. The resource allocation system attempts to respect the input priority, and will prefer to complete higher priority requests before lower priority requests. The initial ordered plan forms a starting point that the system can refine by each of the methods described herein.

The resource allocation system uses three primary methods for matching resources to requests that are described in further detail in the following sections: Preordered/Intelligent Resource Checks, Category Blacklisting, and Resource Holding Area. The Dynamic Routing/Allocation section below describes additional techniques for matching requests to resources.

One resource optimization routine employs abbreviated pre-ordered matching where only a subset of resources are checked in a pre-ordered fashion. For example, physical resources such as drives are in short supply so checks of these resources are quick given their small numbers, and their tendency to more frequently fail. Thus, physical checks are done first, then logical checks performed later, so that when a physical resource fails, it saves the system time it may have spent checking longer lists of logical resources. If a request fails based on a check of this shortened list, then the system moves to the next request in a queue.

As the requests for individual resources grow into the thousands or tens of thousands, and the number of resources grows to the hundreds, even an abbreviated matching list can consume considerable processing cycles. Therefore, a second resource optimization routine, category blacklisting, provides further efficiencies and scale to a queue of potentially thousands of jobs with their associated resource requests by assigning requests with a category code, such as a storage policy. All resource requests having the same category code are associated with the same set of resource requests and rules that govern how to allocate those resources. Once a request is denied, the entire category is blacklisted or otherwise excluded from future resource allocation requests. Thus, subsequent requests in the same category may not even matched or analyzed by the system based on a blacklisting flag associated with that request.

A third resource optimization routine employs a temporary holding area for released resources to preserve priorities. Since resources are released back into a pool of available resources at random times, the problem arises that a request having a lower priority will receive that resource ahead of a request having a higher priority. To overcome this shortcoming, released resources are held in a temporary holding area. When the resource allocation system loops back to the top of the queue (i.e., to those unfulfilled requests having the highest priority), the newly available resources from the temporary holding area are added back to the general pool of available resources. This helps insure that requests having higher priority will have access to any released resources before lower priority requests.

Various examples of the system will now be described. The following description provides specific details for a thorough understanding and enabling description of these examples. One skilled in the art will understand, however, that the system may be practiced without many of these details. Additionally, some well-known structures or functions may not be shown or described in detail, so as to avoid unnecessarily obscuring the relevant description of the various examples.

The terminology used in the description presented below is intended to be interpreted in its broadest reasonable manner, even though it is being used in conjunction with a detailed description of certain specific examples of the system. Certain terms may even be emphasized below; however, any terminology intended to be interpreted in any restricted manner will be overtly and specifically defined as such in this Detailed Description section.

Suitable System

Referring to FIG. 1A, a block diagram illustrating components of a data stream is shown. The stream 110 may include a client 111, a media agent 112, and a secondary storage device 113. For example, in storage operations, the system may store, receive and/or prepare data to be stored, copied or backed up at a server or client 111. The system may then transfer the data to be stored to media agent 112, which may then refer to storage policies, schedule policies, and/retention policies (and other policies), and then choose a secondary storage device 113 for storage of the data. Secondary storage devices may be magnetic tapes, optical disks, USB and other similar media, disk and tape drives, and so on.

Referring to FIG. 1B, a block diagram illustrating components of multiple selectable data streams is shown. Client 111 and any one of multiple media agents 112 may form a stream 110. For example, one stream may contain client 111, media agent 121, and storage device 131, while a second stream may use media agent 125, storage device 133, and the same client 111. Additionally, media agents may contain additional subpaths 123, 124 that may increase the number of possible streams for client 111. Examples of subpaths 123, 124 include host bus adapter (HBA) cards, Fibre Channel cards, SCSI cards, and so on. Thus, the system is able to stream data from client 111 to multiple secondary storage devices 113 via multiple media agents 112 using multiple streams.

Referring to FIG. 1C, a block diagram illustrating components of alternative multiple selectable data streams is shown. In this example, the system may transfer data from multiple media agents 151, 152 to the same storage device 113. For example, one stream may be from client 141, to media agent 151, to secondary storage device 113, and a second stream may be from client 142, to media agent 152, to secondary storage device 113. Thus, the system is able to copy data to one secondary storage device 113 using multiple streams 110.

Additionally, the system may stream may be from one client to two media agents and to one storage device. Of course, the system may employ other configurations of stream components not shown in the Figures.

Referring to FIG. 2, a block diagram illustrating an example of a data storage system 200 is shown. Data storage systems may contain some or all of the following components, depending on the needs of the system.

For example, the data storage system 200 contains a storage manager 210, one or more clients 111, one or more media agents 112, and one or more storage devices 113. Storage manager 210 controls media agents 112, which may be responsible for transferring data to storage devices 113. Storage manager 210 includes a jobs agent 211, a management agent 212, a database 213, and/or an interface module 214. Storage manager 210 communicates with client(s) 111. One or more clients 111 may access data to be stored by the system from database 222 via a data agent 221. The system uses media agents 112, which contain databases 231, to transfer and store data into storage devices 113. Client databases 222 may contain data files and other information, while media agent databases 231 may contain indices and other data structures that assist and implement the storage of data into secondary storage devices, for example.

The data storage system may include software and/or hardware components and modules used in data storage operations. The components may be storage resources that function to copy data during storage operations. The components may perform other storage operations (or storage management operations) other than operations used in data stores. For example, some resources may create, store, retrieve, and/or migrate primary or secondary data copies. The data copies may include snapshot copies, backup copies, HSM copies, archive copies, and so on. The resources may also perform storage management functions that may communicate information to higher level components, such as global management resources.

In some examples, the system performs storage operations based on storage policies, as mentioned above. For example, a storage policy includes a set of preferences or other criteria to be considered during storage operations. The storage policy may determine or define a storage location and/or set preferences about how the system transfers data to the location and what processes the system performs on the data before, during, or after the data transfer. Storage policies may be stored in storage manager 210, or may be stored in other resources, such as a global manager, a media agent, and so on. Further details regarding storage management and resources for storage management will now be discussed.

Referring to FIG. 3, a block diagram illustrating an example of components of a server used in data storage operations is shown. A server, such as storage manager 210, may communicate with clients 111 to determine data to be copied to primary or secondary storage. As described above, the storage manager 210 may contain a jobs agent 211, a management agent 212, a database 213, and/or an interface module 214. Jobs agent 211 may manage and control the scheduling of jobs (such as copying data files) from clients 111 to media agents 112. Management agent 212 may control the overall functionality and processes of the data storage system, or may communicate with global managers. Database 213 or another data structure may store storage policies, schedule policies, retention policies, or other information, such as historical storage statistics, storage trend statistics, and so on. Interface module 214 may interact with a user interface, enabling the system to present information to administrators and receive feedback or other input from the administrators or with other components of the system (such as via APIs).

The server 300 contains various components of the resource allocation system such as a receive requests component 310, an optimize requests component 320, a handle requests component 330, a request store component 340, a resource store component 350, and a build grid component 360. The receive requests component 310 provides input to the resource allocation system in the form of an initial priority ordered plan (e.g., a list or queue) that specifies requests to be performed by the system. The optimize requests component 320 modifies the initial plan by performing one or more optimization routines on the initial plan. For example, the resource allocation system may perform abbreviated checks to remove requests from the initial plan that would fail. The handle requests component 330 dispatches, performs or at least initializes each request and performs other optimizations based on the result of each request. For example, when a request fails, the handle requests component 330 may remove similar requests from the plan, such as by blacklisting categories of requests. The request store component 340 stores the request plan and supplemental information about each request, such as any associated storage policy or metadata. The resource store component 350 stores information about the resources available throughout the system, such as drives, networks, media, and so forth. The build grid component 360 builds a grid of resources and requests by subdividing resources and requests according to certain criteria, such as geographic location of resources. These components and their operation are described in further detail herein.

Storage operations cells may by organized in a hierarchical fashion to perform various storage operations within a network. Further details on such a hierarchical architecture may be found in U.S. Patent Application No. 11/120,662, entitled Hierarchical Systems and Methods for Providing a Unified View of Storage Information, filed May 2, 2005 (attorney document no. 60692.8019), which is hereby incorporated herein by reference.

Preordered/Intelligent Resource Checks

The resource allocation system or process matches pending requests to resources by performing abbreviated checks to determine if a request would succeed. If a check fails, then the resource allocation system can move on to the next request, having spent less time than if the system began performing the data storage operation specified by the request in full (e.g., not abbreviated) and the data storage operation failed. The resource allocation system performs the abbreviated checks in a preordered fashion. For example, physical checks (e.g., determining whether the necessary hardware is working) may be performed before logical checks (e.g., determining whether a device is in use). An underlying goal is to check requests that have the highest chance of failure to remove them from the plan first. In addition, since many requests have dependent requests, when the resource allocation system can eliminate a request from the plan, it can also remove each of the request's dependent requests. Requests that fail the checks would have failed anyway, so determining that before resources have been committed and time has been wasted allows the system to spend more time working on requests that are more likely to succeed.

In one example, since physical resources are often the most difficult to acquire or fulfill, the matching routine may analyze a current job in a queue and first recognize that one resource being requested is a tape drive. Typically, physical resources like disks or tape drives are in short supply. The matching routine may perform a physical check to see whether that physical resource (tape drive) is offline or malfunctioning (a physical check) before doing a logical check of that resource (e.g., whether it is busy with another job). Likewise, the matching process may look at the state of the network to determine whether a path or stream is possible to satisfy a currently analyzed job. Physical checks are often faster than logical checks as they operate on a smaller list of available resources and tend to fail more often. When a physical check fails, the resource allocation system saves time that might have been wasted if a longer logical check had been performed first.

The following is an example of physical checks that may be performed by the resource allocation system in the following order:

-   -   1) availability of drives (e.g., tape, optical, or magnetic),     -   2) availability of device streams that can be used on the         storage policy,     -   3) availability of a tape library, availability of a media         agent,     -   4) availability of a host bus adapter card, number of drives         that can be used in parallel from a tape library (e.g., a drive         allocation policy for a library),     -   5) number of drives that can be used on a single media agent at         a given time (e.g., a drive allocation policy for a drive pool),     -   6) availability of tapes or other media, and so forth.

The resource allocation system next performs logical checks. For example, the system may determine whether a tape drive is busy or flagged do not use. Some devices may be reserved for other purposes outside of the resource allocation system such as an administrator indicating that the resource allocation system should not schedule requests to use those devices.

In some embodiments, the resource allocation system compares only a subset of resource requests for that job to available resources to determine whether that job can be satisfied. A resource allocation system or process may include an abbreviated pre-order matching routine that identifies a subset of resource requests and performs a matching or allocation based on only that subset, rather than the entire set of resources. Notably, if the matching routine determines, based on an ordered review of a subset of resource requests in a given job, that that job cannot be satisfied, then the matching routine may stop any further matching operations for that job and simply move to the next job in the queue. This saves valuable resources in determining how to optimize each job (with its associated resource requests) in the queue. Rather than try to satisfy requests, the matching routine looks to eliminate jobs that request resources that cannot currently be satisfied.

Thus, the matching routine in one example reviews each incoming job in the queue, in order, and determines whether that job can be skipped, ignored or excluded during a current cycle through the queue. The matching routine compares the various resource requests in the job to the resources above to see whether that current job can be ignored. For example, a current job analyzed by the matching routine may request the following resources: an available tape drive, a device stream for a storage policy, access to a tape library, a particular media agent, and an available tape. The matching routine first determines whether a tape drive is physically available, and if so, determines whether a device stream to that tape drive for the storage policy is available. If not, then the matching routine does not look to determine whether any other resources in the job are available, but simply moves to the next job in the queue.

FIG. 4 is a flow diagram that illustrates the processing of the optimize requests component 320, in one embodiment. The optimize requests component 320 is invoked when the resource allocation system receives an initial plan containing a prioritized list of data storage requests. The plan may come from a variety of sources. For example, an administrator may create the initial plan that is composed of each of the storage requests that the administrator wants a data storage system to perform. As another example, each computer in an organization may automatically register a nightly request to archive files or perform other data storage operations, and a data storage system may collect the requests. In block 410, the component receives the initial list of requests. The list of requests is in priority or other order, based on requirements of the data storage system for completing the requests. For example, an administrator may order requests based on the importance that each request be completed. In block 420, the component performs the abbreviated checks described above. For example, the component may perform a series of physical checks for each request followed by a series of logical checks (and the checks may only check a subset of the resources required by the request). When a request fails a particular check, the request is removed from the list or marked as not to be performed. In block 430, the component handles each request in turn by traversing the list as described in further detail with respect to FIG. 5.

Under the abbreviated or preordered/intelligent check of resources, the system performs one or more abbreviated checks by performing a selected subset of one or more checks for whether one or more selected data storage resources are available to satisfy a data storage request. As noted above, the selected one data storage resource may be a physical data storage resource, rather than a logical data storage resource. When the one or more abbreviated checks indicate that the data storage request would fail if performed, then the system updates the list of data storage requests to indicate that that data storage request should not be performed, without attempting to perform that data storage request and without performing a check of all data storage resources required for performing that data storage request. Thereafter, the system again performs one or more abbreviated checks on the next data storage request in the list of data storage requests, and updates the list when the newly performed abbreviated checks indicate that this next data storage request would fail if performed.

When the abbreviated pre-ordered checks are complete, the resource allocation system will have reduced the initial ordered plan by removing requests from the plan that would eventually fail if allowed to continue. The number of requests removed can be significant, leading to significant time savings and more time for the resource allocation system to spend on requests that will likely succeed.

Category Blacklisting

The second method the resource allocation system uses for matching resources to requests is called category blacklisting. As the request count and number of resources grow, even the abbreviated checks described above may take a long time. Category blacklisting attempts to make the process of matching resources to requests more efficient so that the system scales further. This process is described in further detail with reference to FIG. 5.

In some embodiments, each request has a category code. For example, the category code may be associated with a storage policy. All requests in the same category have similar resource requirements and rules governing how to allocate the resources. When the resource allocation system denies a request, either during abbreviated checking or by actually attempting to perform the request and reaching a point of failure, then the resource allocation system blacklists the entire category. If one request in the category fails then all of the requests in the category are likely to fail. By removing these other requests, the resource allocation system prevents wasting time on failed requests and has more time available to perform requests that will likely succeed.

In some embodiments, the resource allocation system determines a likeness factor that indicates a degree of similarity between requests assigned to a particular category. If the resource allocation system denies a request, then the resource allocation system removes other requests that have a high enough likeness factor from the plan. The likeness factor considers a subset of resources that the request needs, and may separate the resources into common and unique subsets. The resource allocation system may take two or three of the most common resources to compose the likeness factor. If a request is denied for a common resource, then the resource allocation system blacklists the category. However, if the request is denied based on a unique resource, then it is less likely that the entire category will suffer the same failure, and the resource allocation system may not blacklist the category.

When the category blacklisting process is complete, the resource allocation system will have further reduced the ordered plan by removing requests from the plan that would eventually fail if allowed to continue. The number of requests removed can be significant, leading to significant time savings and more time for the resource allocation system to spend on requests that will likely succeed.

Resource Holding Area

To preserve priorities in the job queue, the resource allocation system may employ a temporary holding area for released resources. As jobs are completed, resources are freed up to be available for subsequent job requests. However, as noted above, the freed up resources could be provided to lower priority jobs. Therefore, when resources are freed up, the resource allocation system places them into a logical holding area. They remain in the holding area until the resource allocation system has walked the queue and returns to the top of the queue. At that point, the resource allocation system starts afresh and may satisfy highest priority job requests with all available resources, including those that had been freed up while walking the queue and placed temporarily in the holding area.

In general, the resource allocation system operates in a loop repeatedly traversing the ordered plan of requests and initiating each request in turn as the resources required by the request are available. When the end of the plan is reached, the system starts again from the top. However, a problem arises when a request with lower priority attempts to grab a resource ahead of a request with higher priority. This can happen as the system is traversing the plan when a particular request completes that releases resources that caused a request earlier in the plan to be postponed. Thus, when resources are released by a request, the resource allocation system may place these resources into a temporary resource holding area so that they are not yet available to subsequent requests in the plan. When the resource allocation loop returns to the top of the plan, the resource allocation system makes these resources available to all requests. This ensures that requests with highest priority will have access to any released resources first.

In some embodiments, the resource allocation system receives configuration information from an administrator that alters the behavior of the resource holding area. For example, the configuration information may indicate that the resource holding area is not used until the resource allocation system is at a certain depth in the plan or queue. As an example, if the resource allocation system is at the third request in the plan and the second request released a resource that the third request needs, then the resource allocation system may go ahead and give the resource to the third request rather than waiting until the entire plan has been traversed. It is less likely that any of the prior higher priority requests in the plan need the resource or the resource may be available again by the time the resource allocation system has traversed the entire plan.

The resource allocation system may include an administrator interface (not shown) that allows a system administrator to configure the holding area. For example, if a resource is freed up while the system is analyzing the first quarter or tenth of the queue, then it may be immediately placed back into the pool of available resources, rather than waiting for the entire queue to be walked. As another example, if the resource allocation system is currently analyzing the third request in the queue, and that third request may be satisfied with a resource in the temporary holding area, then the system may allocate that resource from the temporary holding area. Alternatively or additionally, the user interface may permit the system administrator to otherwise allow a resource from the temporary holding area to be placed into the available resource pool based on wait times or otherwise override default operation.

FIG. 5 illustrates the processing of the handle requests component 330 and/or 430, in one embodiment. The component is invoked to traverse the list of requests and perform each request. In block 510, the component retrieves the first request from the list. In block 520, the component dispatches the request. Those of ordinary skill in the art will recognize that dispatching a request may include many steps, such as retrieving media from a media library, loading the library in a disk drive, gathering data from one or more computer systems, performing data management operations on the data (e.g., compression, encryption, copying, etc.), and so forth. In decision block 530, the component determines whether the requested resource has been blacklisted by comparing the request to the generated blacklist table. If it is, then the routine proceeds in getting the next request in block 580 and loops back to again dispatching the request in block 520.

If the requested resource is not blacklisted, then in block 535 a component performs one or more abbreviated checks, as described above with respect to FIG. 4. If the check fails, then the component continues at block 540, else the component continues at block 545. In block 540, the component determines the category of the failed request and adds the category to a blacklist so that other requests that belong to the category will be flagged as likely to fail and thus not be dispatched by the resource allocation system. In some embodiments the resource allocation system first determines if the resource that caused the request to fail is common (e.g., such that other similar resources are readily available) or unique, and only blacklist the category if the resource is a unique resource.

In decision block 545, the component performs a full check, and if that fails, the routine again loops back to adding the request to the blacklist category (block 540), getting the next request (block 580), and dispatching the request (block 520). If the full check succeeds, then the resource request is granted in block 550.

In decision block 555, if the request released resources, then the component continues at block 560, else the component continues at block 570. In block 560, the component places the released resources in a temporary resource holding area for release after the entire list has been traversed. This prevents lower priority requests at the end of the list from grabbing resources that higher priority requests at the beginning of the list are waiting for. In decision block 570, if there are more requests in the list, then the component continues at block 580, else the component continues at block 590. In block 580, the component retrieves the next request from the list and loops to block 520 to dispatch the request. Blocks 510-580 depict dispatching requests serially (e.g., one after the other) in order to simplify the illustration of the operation of the resource allocation system. Those of ordinary skill in the art will recognize that in practice many requests can be performed in parallel. In block 590, the component has reached the end of the list and releases any temporarily held resources into a global resource pool for access by any request. The component then loops to block 510 to traverse the list again. Typically the resource allocation system will loop repeatedly performing each request remaining in the list until either the list of requests is exhausted or the end of the storage operation window is reached. For example, the system may stop performing requests at the end of a nightly storage window.

Dynamic Routing/Allocation

Often resources are far flung geographically and throughout a network. For example, a company may have multiple offices each with their own LAN and data management resources. The company may want to perform periodic data management operations at each of the offices. For example, the company may backup user workstations at every office on a weekly basis.

In some embodiments, the resource allocation system forms a set of resources needed by requests in a grid, table or other fashion to match up a set of resources that can best satisfy a request. For example, although any tape library accessible by a user workstation could be used to satisfy a request to backup the user workstation, it is likely desirable to use a tape library that is close to the user workstation (where close could include geographic or network proximity). The resource allocation system may group the data management operations based on a created logical table or grid allocation criteria. For example, the criteria may specify different subsets of data that are important to a particular user of the system. For example, a bank may divide data by branch location or other criteria, or a government office may divide data by the level of security clearance needed to access the network on which the data is originally stored.

In some embodiments, the grid indicates alternate data paths that the resource allocation system can use if a preferred resource fails or is unavailable. The data path may specify a particular network route, tape library, media agent, or other hardware or software to use if that data path is selected to satisfy a request. For example, if there is a failover or load issue during the completion of a request, then the resource allocation system can attempt to use one of the alternate data paths. The resource allocation system may determine if all of the elements of a particular data path are available before choosing the alternate data path.

In some embodiments, the resource allocation system receives rules that specify how alternate data paths are chosen. For example, an overflow rule may specify that when a particular data path or hardware resource is full, requests for the data path should failover to a second specified data path. Alternatively or additionally, a rule may specify that the resource allocation system should select data paths in an alternating or round robin fashion to spread the load of requests evenly across data paths, or among lesser used resources to evenly spread wear out among all components.

FIG. 6 is a flow diagram that illustrates the processing of the build grid component 360, in one embodiment. In block 610, the component receives an initial request plan as described above. In block 620, the component determines the grid dimensions. For example, the component may group the resources in columns according to common characteristics of the resources, such as geographic proximity, network topology, departments within an organization, and so forth. Then, rows of requests may be assigned to each column. In block 630, the component assigns requests to the appropriate cell within the grid. For each request, the component may assign the request based on the resources required by the request or other configured criteria. The component may also determine an alternate set of resources for a request to be used when a first set of resources is not available. The methods described above for optimizing the allocation of requests to resources may be used to optimize the entire grid or columns of resources within the grid. In block 640, the component dispatches the requests within the grid. Each column of the grid may operate independently on separate lists of requests, and the component may process each list in parallel as well as performing operations within each list in parallel.

Alternatively or additionally, the grid may include a row or column of priorities associated with each data management operation in a list of data management operations. The system may then perform data management operations along one dimension of the grid in priority order. Overall, the resource allocation system provides a highly scalable data storage system.

The resource allocation system can use the division of resources and requests into a grid to further subdivide the problem of matching requests to resources. Within each dimension of the grid, the resource allocation system may apply the techniques described above for reducing the number of requests to be performed or the resources used for completing those requests. Thus, the grid store provides additional scalability for performing data management operations.

FIG. 7 is a block diagram that illustrates an example environment in which the grid store may be applied. An organization may have three offices: Office A 710, as LAN 730 and WAN 760 (e.g., the Internet). Each office has its own computer systems and data storage devices (e.g., tape libraries, media agents, disk drives, and so forth). For example, Office A 710 has one or more user workstations 715 and 720 and one or more data storage devices such as device 725, whereas Office B contains an E-Commerce server 745 (such as a website) and a database server 750. Office B also has one or more data storage devices, such as device 755. Office C has an ERP system 775 and email sever 780 and one or more data storage devices, such as device 785. Typically, it is desirable to allow the data storage devices at each location handle the data storage requests for that location. For example, the data storage device 725 is typically the preferred device for handling data storage requests related to workstation 715. However, because each office is connected by a network, it is also possible for data storage device 755 to handle data storage requests related to workstation 715. The resource allocation grid divides requests among the available resources, tracking the types of preferences described. In addition, the resource allocation grid may specify one or more alternate data paths to be used when the preferred data path or data storage device is not available.

Overall, this system receives data storage requests to be performed in, e.g., multiple, geographically separated locations, wherein each of the locations includes separate data storage resources. For each data storage request, and before receiving data storage requests to be executed, the system determines at least two different sets of data storage resources to handle the request. The first set of data storage resources is a preferred set of resources to handle the request, while the second set is an alternate set of data storage resources to handle the request. Before receiving data storage requests to be executed, the system establishes a data storage resource allocation based at least in part on one of the sets of data storage resources determined to handle each data storage request.

Likewise, the system receives a list of multiple data management resources for use in performing multiple data management operations and intelligently allocates those resources. For at least some of the multiple data management resources, the system determines characteristics of the data management resource (e.g. geographic proximity, network topology, departments within a business organization, type of hardware, etc.) and groups at least two of the resources based on a similar characteristic. For each data management operation in the list, the system then determines a location of one or more data objects affected by the data management operation, and creates a logical table for allocating the data management resources. This logical table for resource allocation selects one or more data management resources to perform each of the multiple data management operations in the list, based at least in part on the determined physical location of the one or more data objects affected by the data management operation, and on the grouping of at least two of the data management resources based on a similarity of determined characteristics. Thereafter the system may perform each data management operation based on the created logical table for resource allocation.

CONCLUSION

The resource allocation system described herein provides a tremendous amount of flexibility. It is applicable to heterogeneous networks and systems, as well as systems that grow in size and diversity. Thus, a data management system with only a few localized cells may grow 10 or 100 times larger with multiple geographically disbursed cells, without compromising on backup windows and resource allocation times. For example, prior systems could accommodate 300 resource allocations in an hour, but under the present system, could handle over 40,000.

Systems and modules described herein may comprise software, firmware, hardware, or any combination(s) of software, firmware, or hardware suitable for the purposes described herein. Software and other modules may reside on servers, workstations, personal computers, computerized tablets, PDAs, and other devices suitable for the purposes described herein. In other words, the software and other modules described herein may be executed by a general-purpose computer, e.g., a server computer, wireless device or personal computer. Those skilled in the relevant art will appreciate that aspects of the invention can be practiced with other communications, data processing, or computer system configurations, including: Internet appliances, hand-held devices (including personal digital assistants (PDAs)), wearable computers, all manner of cellular or mobile phones, multi-processor systems, microprocessor-based or programmable consumer electronics, set-top boxes, network PCs, mini-computers, mainframe computers, and the like. Indeed, the terms “computer,” “server,” “host,” “host system,” and the like are generally used interchangeably herein, and refer to any of the above devices and systems, as well as any data processor. Furthermore, aspects of the invention can be embodied in a special purpose computer or data processor that is specifically programmed, configured, or constructed to perform one or more of the computer-executable instructions explained in detail herein.

Software and other modules may be accessible via local memory, via a network, via a browser or other application in an ASP context, or via other means suitable for the purposes described herein. Examples of the technology can also be practiced in distributed computing environments where tasks or modules are performed by remote processing devices, which are linked through a communications network, such as a Local Area Network (LAN), Wide Area Network (WAN), or the Internet. In a distributed computing environment, program modules may be located in both local and remote memory storage devices. Data structures described herein may comprise computer files, variables, programming arrays, programming structures, or any electronic information storage schemes or methods, or any combinations thereof, suitable for the purposes described herein. User interface elements described herein may comprise elements from graphical user interfaces, command line interfaces, and other interfaces suitable for the purposes described herein. Screenshots presented and described herein can be displayed differently as known in the art to input, access, change, manipulate, modify, alter, and work with information.

Examples of the technology may be stored or distributed on computer-readable media, including magnetically or optically readable computer discs, hard-wired or preprogrammed chips (e.g., EEPROM semiconductor chips), nanotechnology memory, biological memory, or other data storage media. Indeed, computer implemented instructions, data structures, screen displays, and other data under aspects of the invention may be distributed over the Internet or over other networks (including wireless networks), on a propagated signal on a propagation medium (e.g., an electromagnetic wave(s), a sound wave, etc.) over a period of time, or they may be provided on any analog or digital network (packet switched, circuit switched, or other scheme).

Unless the context clearly requires otherwise, throughout the description and the claims, the words “comprise,” “comprising,” and the like are to be construed in an inclusive sense, as opposed to an exclusive or exhaustive sense; that is to say, in the sense of “including, but not limited to.” As used herein, the terms “connected,” “coupled,” or any variant thereof, means any connection or coupling, either direct or indirect, between two or more elements; the coupling of connection between the elements can be physical, logical, or a combination thereof. Additionally, the words “herein,” “above,” “below,” and words of similar import, when used in this application, shall refer to this application as a whole and not to any particular portions of this application. Where the context permits, words in the above Detailed Description using the singular or plural number may also include the plural or singular number respectively. The word “or,” in reference to a list of two or more items, covers all of the following interpretations of the word: any of the items in the list, all of the items in the list, and any combination of the items in the list.

While certain aspects of the technology are presented below in certain claim forms, the inventors contemplate the various aspects of the technology in any number of claim forms. For example, while only one aspect of the technology is recited as a means-plus-function claim under 35 U.S.C. §112, other aspects may likewise be embodied as a means-plus-function claim. Accordingly, the inventors reserve the right to add additional claims after filing the application to pursue such additional claim forms for other aspects of the technology.

The above detailed description of examples of the technology is not intended to be exhaustive or to limit the invention to the precise form disclosed above. While specific embodiments of, and examples for, the invention are described above for illustrative purposes, various equivalent modifications are possible within the scope of the invention, as those skilled in the relevant art will recognize. For example, while processes or blocks are presented in a given order, alternative embodiments may perform routines having steps, or employ systems having blocks, in a different order, and some processes or blocks may be deleted, moved, added, subdivided, combined, and/or modified to provide alternative or subcombinations. Each of these processes or blocks may be implemented in a variety of different ways. Also, while processes or blocks are at times shown as being performed in series, these processes or blocks may instead be performed in parallel, or may be performed at different times.

The teachings of the technology provided herein can be applied to other systems, not necessarily the system described above. The elements and acts of the various embodiments described above can be combined to provide further examples. Any patents and applications and other references noted above, including any that may be listed in accompanying filing papers, are incorporated herein by reference. Aspects of the invention can be modified, if necessary, to employ the systems, functions, and concepts of the various references described above to provide yet further examples of the technology.

These and other changes can be made to the invention in light of the above Detailed Description. While the above description describes certain embodiments of the invention, and describes the best mode contemplated, no matter how detailed the above appears in text, the invention can be practiced in many ways. Details of the system and method for classifying and transferring information may vary considerably in its implementation details, while still being encompassed by the invention disclosed herein. As noted above, particular terminology used when describing certain features or aspects of the invention should not be taken to imply that the terminology is being redefined herein to be restricted to any specific characteristics, features, or aspects of the invention with which that terminology is associated. In general, the terms used in the following claims should not be construed to limit the invention to the specific embodiments disclosed in the specification, unless the above Detailed Description section explicitly defines such terms. Accordingly, the actual scope of the invention encompasses not only the disclosed embodiments, but also all equivalent ways of practicing or implementing the technology under the claims. While certain aspects of the technology are presented below in certain claim forms, the inventors contemplate the various aspects of the technology in any number of claim forms. For example, while only one aspect of the technology is recited as embodied in a computer-readable medium, other aspects may likewise be embodied in a computer-readable medium. Accordingly, the inventors reserve the right to add additional claims after filing the application to pursue such additional claim forms for other aspects of the technology.

From the foregoing, it will be appreciated that specific embodiments of the invention have been described herein for purposes of illustration, but that various modifications may be made without deviating from the spirit and scope of the invention. Accordingly, the invention is not limited except as by the appended claims. 

We claim:
 1. A computer-readable medium encoded with instructions for controlling a computer system to perform a set of data management requests, by a method comprising: receiving a list of data management requests to be performed by a data management system, wherein each data management request is to perform at least one data management operation requesting data management resources; attempting to perform a first data management request from the list of data management requests by assigning the first data management request to one or more data management resources; and if the first data management request fails, then: identifying at least one data management resource at least partially responsible for the failure, determining a category associated with the at least one data management resource at least partially responsible for the failure, wherein each data management resource is associated with a category, identifying other data management requests in the list of data management requests that request data management resources having the same category as the category associated with the at least one data management resource at least partially responsible for the failure, and updating the received list of data management requests to indicate that the data management system is not to perform the identified other data management requests.
 2. The computer-readable medium of claim 1 wherein a category is created for each data management resource, common to some of the data management requests in the list, that is likely to be critical to the performance of a given data management request.
 3. The computer-readable medium of claim 1 wherein the identifying of the other data management requests in the list comprises determining that a threshold is passed by a likeness factor among data management requests requesting similar data management resources.
 4. The computer-readable medium of claim 1 wherein identifying other data management requests in the list comprises determining whether the first data management request failed because of a data management resource that is common among the data management requests in the list, and when the first data management request fails because of a common data management resource, identifying other data management requests in the list requesting data management resources having the same category as the common data management resource.
 5. The computer-readable medium of claim 1 wherein identifying other data management requests in the list comprises determining whether the first data management request failed because of a data management resource that is unique among the data management requests in the list, and when the first data management request fails because of a unique data management resource, then not identifying certain data management requests in the list requesting data management resources having the same category as the unique resource.
 6. A method comprising: receiving a list of data management requests to be performed by a data management system, wherein each data management request is to perform at least one data management operation requesting data management resources, and wherein each data management request is associated with a category; attempting to perform a first data management request from the list of data management requests by assigning the data management request to one or more data management resources; and if the first data management request fails, then: identifying at least one data management resource that is at least partially responsible for the failure, identifying other data management requests in the list of data management requests having the same category as the first data management request, and removing the identified other data management requests from the received list of data management requests to be performed by the data management system.
 7. The method of claim 6 wherein the removing the identified other data management requests from the received list occurs without attempting to perform any of the identified other data management requests after the first data management request fails.
 8. The method of claim 6 wherein a category is created for each data management resource that is determined to be common among a plurality of data management requests in the list and that is likely to be critical to the performance of a given data management request.
 9. The method of claim 6 wherein the identifying of the other data management requests comprises determining that a threshold is passed by a likeness factor among the other data management requests, based on the other data management requests requesting similar data management resources.
 10. The method of claim 6 wherein the identifying of the other data management requests is based on determining that a threshold is passed by a likeness factor among the other data management requests, wherein the likeness factor considers commonality of a subset of data management resources requested by each respective data management request in the list.
 11. The method of claim 6 wherein the identifying of the other data management requests comprises determining whether the first data management request failed because of a data management resource that is common among the other data management requests, and when the first data management request fails because of a common data management resource, identifying the other data management requests having the same category as the first data management request.
 12. The method of claim 6 wherein when the first data management request failed because of a data management resource that is common among the other data management requests, identifying the other data management requests having the same category as the first data management request.
 13. The method of claim 6 wherein the identifying of the other data management requests comprises determining whether the first data management request failed because of a data management resource that is unique to the first data management request, and when the first data management request fails because of a unique resource, then declining to identify certain data management requests in the list of data management requests that have the same category as the first data management request and do not request the unique data management resource.
 14. The method of claim 6 wherein when the first data management request failed because of a data management resource that is unique to the first data management request, declining to identify certain data management requests in the list of data management requests that have the same category as the first data management request and do not request the unique data management resource.
 15. The method of claim 6 wherein when the first data management request failed because of a resource that is unique to the first data management request, declining to remove the identified other data management requests from the received list of data management requests to be performed by the data management system even though the identified other data management requests have the same category as the failed first data management request.
 16. The method of claim 6 wherein the category is associated with a storage policy.
 17. A data management system comprising: resources for performing data storage requests; a server; wherein the server comprises one or more storage policies, and further comprises a queue of data storage requests to be performed in the data management system, wherein each data storage request in the queue is associated with one of the one or more storage policies, wherein a respective associated storage policy comprises criteria for performing the data storage request, and wherein each data storage request in the queue requires resources for performing the respective data storage request; and wherein the server is configured to: classify each data storage request in the queue into one of a plurality of categories, attempt to perform a first data storage request according to a respective associated storage policy, and if the attempted first data storage request fails, remove from the queue, without attempting to perform, other data storage requests classified into the same category as the attempted first data storage request.
 18. The data management system of claim 17 wherein a given data storage request is classified into a given category based on at least one of (a) resources required by the given data storage request, and (b) rules from the respective storage policy associated with the given data storage request.
 19. The data management system of claim 17 wherein two or more data storage requests in the queue are classified into the same category if they require one or more resources in common.
 20. The data management system of claim 17 wherein two or more data storage requests in the queue are classified into different categories if a likeness factor that considers which resources the two or more data storage requests have in common fails to pass a threshold. 